Skip to main content

From Text to Embeddings: Understanding NLP Representations

A conceptual journey from raw text to learned representations

From Text to Embeddings

Understanding NLP Representations

πŸ€”

The Core Question

"How do machines understand human language?"

The Journey We'll Take

graph LR A["πŸ“ Raw Text
'I love ML'"] --> B["πŸ”€ Tokens
['I', 'love', 'ML']"] B --> C["πŸ”’ Numbers
[245, 1089, 3421]"] C --> D["πŸ“Š Embeddings
[0.2, -0.5, 0.8, ...]"] D --> E["πŸ€– Model
Understanding"] E --> F["✨ Output
Predictions"] style A fill:#F3F2F1,stroke:#0078D4,stroke-width:2px style B fill:#FFF9C4,stroke:#FBC02D,stroke-width:2px style C fill:#FFE0B2,stroke:#F7630C,stroke-width:2px style D fill:#E1BEE7,stroke:#9C27B0,stroke-width:2px style E fill:#0078D4,stroke:#0078D4,stroke-width:2px,color:#fff style F fill:#107C10,stroke:#107C10,stroke-width:2px,color:#fff

This module demystifies each step of this transformation

What You'll Master

🎯

The Why

Why models need numbers

πŸ“ˆ

The Evolution

From BoW to Transformers

βš™οΈ

The Mechanics

How tokenizers work

🎨

The Choices

When to use what

πŸ”§

The Practice

Hands-on workflows

πŸš€

The Integration

Connecting to LLMs

Your Learning Path (7 Parts)

1
🎯

Foundation

Why do models need numbers? What makes text unique?

3 sections
2
πŸ“ˆ

Evolution

BoW β†’ TF-IDF β†’ Word2Vec β†’ Contextual Embeddings

4 sections
3
βš™οΈ

Mechanics

How embeddings emerge, tokenization deep dive

4 sections
4
🎨

Decisions

When to use embeddings? Build vs pretrained?

2 sections
5
πŸ“‹

Guidelines

What to watch out for, recommended workflows

2 sections
6
πŸ”—

Integration

Complete pipeline, connecting to modern LLMs

2 sections
7
πŸ’»

Practice

Hands-on Jupyter notebook with real data

1 section
1 β†’ 2 β†’ 3 β†’ 4 β†’ 5 β†’ 6 β†’ 7

Follow the path to master text representations

πŸ‘‡ Let's Start with Part 1: Foundation

Part 1: The Foundation

The 3 Questions We'll Answer

πŸ”’

Why Numbers?

Why do models need numeric inputs?

πŸ“

Why Is Text Different?

What makes NLP uniquely challenging?

❓

The Core Problem?

What challenge must we solve?

Your Path Through Part 1

graph LR S1["Section 1
🎯 What Is a Model?
Models need numbers"] --> S2["Section 2
πŸ”„ Features vs Representations
Text is unique"] S2 --> S3["Section 3
❓ The Text Problem
The challenge defined"] style S1 fill:#E3F2FD,stroke:#0078D4,stroke-width:3px style S2 fill:#FFF9C4,stroke:#FBC02D,stroke-width:2px style S3 fill:#F3E5F5,stroke:#9C27B0,stroke-width:2px
☁️

Real Examples
Weather, Spam, Recommendations

πŸ“Š

Visual Comparisons
Structured vs Images vs Text

πŸ’‘

Key Insights
The double learning problem

πŸ‘‡ Section 1: What Is a Model?

Section 1: What Is a Model?

πŸ’‘

Before we dive into text, let's understand what models actually do.

πŸ”‘ The Critical Insight

Machine learning models must work with numbers because they use mathematical operations (multiplication, addition, derivatives) to learn patterns. Text like "hot" or "sunny" cannot be directly multiplied or differentiated.

Numbers vs Text Comparison

Left: With numbers, models can perform math operations and learn. Right: Text cannot be used in mathematical operations directly.

Let's see this principle in action with three everyday examples:

☁️
Weather Prediction
πŸ“₯ Inputs (All Numbers)
  • 🌑️ Temperature: 23Β°C
  • πŸ’§ Humidity: 65%
  • 🎚️ Pressure: 1013 hPa
↓

Model processes these numbers

↓
πŸ“€ Output (A Number)

Rain Probability: 0.7 (70% chance)

βœ‰οΈ
Spam Detection
πŸ“₯ Inputs (All Numbers)
  • πŸ“§ Word count: 45
  • πŸ’Έ Has "urgent": 1
  • πŸ”— Link count: 3
↓

Model processes these numbers

↓
πŸ“€ Output (A Number)

Spam Score: 0.9 (90% spam)

⭐
Product Recommendation
πŸ“₯ Inputs (All Numbers)
  • πŸ‘€ User age: 28
  • πŸ›’ Past purchases: 15
  • ⏱️ Time on page: 120s
↓

Model processes these numbers

↓
πŸ“€ Output (A Number)

Interest Score: 0.85 (85% likely)

What Do All These Have in Common?

graph LR A["πŸ“Š Numeric Inputs
(temperatures, counts, ages)"] --> B["πŸ”’ Mathematical Operations
(multiply, add, gradients)"] B --> C["πŸ“ˆ Numeric Outputs
(probabilities, scores)"] style A fill:#E3F2FD,stroke:#0078D4,stroke-width:2px style B fill:#0078D4,stroke:#0078D4,stroke-width:2px,color:#fff style C fill:#107C10,stroke:#107C10,stroke-width:2px,color:#fff

The Pattern: All models follow the same principle:
Numbers In β†’ Math Processing β†’ Numbers Out

🧠 Deep Dive: How Models Learn From Numbers

Models don't just process numbersβ€”they learn patterns by adjusting parameters through gradient descent.

Why Numbers Enable Learning

Left: A model finding patterns in numeric data (temperature β†’ rain probability). Right: Gradient descent optimizing model parametersβ€”requires numeric derivatives!

Key Point: Learning requires computing gradients (derivatives), which only work with numbers. This is why "converting text to numbers" isn't just preprocessingβ€”it's the fundamental bridge that enables learning.

πŸ’‘

Key Takeaway

All machine learning models work the same way: They take numbers as input, process them mathematically, and produce numbers as output.

Whether it's weather, spam, or recommendationsβ€”models only understand numbers. This is not a limitation, it's how they learn patterns from data.

πŸ€” Check Your Understanding

Q1: Why do machine learning models require numeric inputs?

βœ… Correct: Gradient descent operates on mathematical functions that need numeric parameters
❌ Wrong: Computers can only store numbers
❌ Wrong: Text takes too much memory
πŸ’‘ Explanation: Models learn by computing gradients (derivatives) to update parameters. This mathematical process requires numeric valuesβ€”you cannot take the derivative of "sunny" or "cat"!

Q2: What makes NLP uniquely challenging compared to structured data?

βœ… Correct: We must learn both representations AND task relationships (double learning problem)
❌ Wrong: Text has more noise than numbers
❌ Wrong: Natural language is ambiguous
πŸ’‘ Explanation: With structured data, features are given (age, price, etc.). With text, we must first learn how to represent words as numbers, THEN learn the task. That's two learning problems!

βœ… Great! You now understand that models need numbers.
But here's the critical question: Are all data types equally easy to convert to numbers?

Section 2: The Unique Challenge of Text - Features vs Representations

πŸ€”

Text is fundamentally different from other data types. Let's see why.

Comparing Three Data Types

Understanding how different data types are processed helps us see why NLP is unique

Three Data Types Comparison

Top (Blue): Structured data features are given. Middle (Orange): Image features are learned implicitly by CNNs. Bottom (Purple): Text representations MUST be explicitly learnedβ€”this is the critical difference!

πŸ”₯ The Critical Difference: Double Learning Problem

πŸ“Š
Structured Data

Features given
β†’ Learn relationships

πŸ–ΌοΈ
Images

Pixels given
β†’ CNN learns features
β†’ Learn relationships

πŸ“
Text/NLP

Symbols given
β†’ MUST learn representations
β†’ Learn relationships

Double Learning Problem

Left: Structured data has ONE learning problem (relationships). Right: Text has TWO learning problems (representations + relationships)β€”this is unique to NLP!

πŸ’‘ Why This Matters: This double learning problem is why preprocessing and representation choices are so critical in NLP. Choose the wrong representation β†’ the model can't learn the task effectively, no matter how sophisticated it is!

Two Approaches: Who Decides the Numbers?

Feature Engineering vs Representation Learning

Left (Orange): Feature Engineeringβ€”YOU manually design what numbers to extract. Right (Purple): Representation Learningβ€”the MODEL automatically learns the best numbers.

🧠 Feature Engineering

Who decides: Human expert

Output: Sparse vectors (mostly zeros)

Example: [5, 1, -1, 0]

You manually count words, negations, etc.

πŸ€– Representation Learning

Who decides: Learning algorithm

Output: Dense vectors (all values meaningful)

Example: [-0.23, 0.45, ..., 0.92] (768D)

Model learns patterns from data automatically

Why This Matters for NLP

1️⃣ Double Learning Challenge

We're learning two things simultaneously: What are good representations? (embedding layer) + How to use them? (task layers)

2️⃣ Quality is Critical

Bad representations β†’ model can't learn well
Good representations β†’ model learns easily

3️⃣ YOU Must Decide

Unlike structured data (features given) or images (CNN handles it), in NLP YOU must choose how to convert text to numbers

Quick Comparison Across Domains

πŸ“Š
Structured Data

Input: Numbers

Features: Given by data

Learns: Relationships only

Challenge: Algorithm choice

πŸ–ΌοΈ
Images

Input: Pixels (numbers)

Features: CNN extracts automatically

Learns: Features + Relationships

Challenge: CNN architecture

πŸ“
Text/NLP

Input: Symbols (text)

Features: MUST learn explicitly

Learns: Reps + Features + Relations

Challenge: Learning representations

πŸ’‘

Why we spend so much time on text-to-numbers:

Unlike structured data (features given) or images (conv handles it), in NLP YOU must decide how to convert text to numbers.

  • Choose wrong representation β†’ model fails (can't learn)
  • Choose right representation β†’ model succeeds (learns patterns)

This module teaches you to make that choice wisely.

The rest of the module answers: "What are good representations and how do we create them?"

Section 3: The Text Problem

❓

Now we know models need numbers and text is uniquely challenging. So what's the actual problem?

The Challenge

Consider this movie review:

"This movie was absolutely fantastic! The acting was superb and the plot kept me engaged throughout."

πŸ€”

How do we give this to a model?

graph LR A["πŸ“ Text:
'fantastic movie...'"] --> B["❓❓❓
Convert to numbers?"] B --> C["πŸ€– Model
(needs numbers)"] C --> D["βœ… Positive/
❌ Negative"] style A fill:#FFF9C4,stroke:#F7630C,stroke-width:2px style B fill:#FFF100,stroke:#D13438,stroke-width:3px style C fill:#0078D4,stroke:#0078D4,stroke-width:2px,color:#fff style D fill:#107C10,stroke:#107C10,stroke-width:2px,color:#fff

The ??? represents the critical challenge we need to solve!

πŸ”’
Models Need Numbers

For gradient descent and learning

πŸ“
Text Is Symbolic

Words, letters, punctuationβ€”not numbers

πŸŒ‰
Need a Bridge

Text β†’ Numbers

⚠️ The Critical Constraint

The numbers we create must preserve meaning. If we lose meaning in the conversion, the model can't learn useful patternsβ€”no matter how sophisticated it is!

Good vs Bad Representation

Left (Bad): ASCII codes lose all semantic meaningβ€”model can't learn. Right (Good): Semantic embeddings preserve meaningβ€”model can learn patterns.

πŸ’‘ Key Insight: Not all numbers are equal! The quality of your text-to-number conversion determines everything downstream.

🎯 The Question We Must Answer

"How do we convert language into numbers?
And which numbers?
There are infinite ways to do thisβ€”which approach makes sense?"

The Text-to-Numbers Problem: Possible Approaches

Text

to

Numbers?

Bag of Words

(Count)

= Simple

= No order!

TF-IDF

(Weighted)

+ Informative

= Still sparse

Word2Vec

(Dense)

+ Semantic

= Static

Contextual

(BERT)

+ Context-aware

= Expensive

Character

(Char-level)

= Complex

= too granular

Subword

(BPE)

+ Balanced

= Complex

N-grams

(Sequences)

+ Expressive

+/- Info

Hash

(Encoding)

+ Fast

= Collisions

+ Fine-grained!

Each approach has trade-offsβ€”there is no single 'best' solution!

Multiple approaches exist: Bag of Words, TF-IDF, Word2Vec, BERT, Character-level, Subword, N-grams, Hashingβ€”each with trade-offs!

πŸ”₯ Why This Matters

❌ Bad Representation

Model can't learn patterns, no matter how good the architecture

βœ… Good Representation

Model learns easily, even with simple architecture

This isn't just preprocessingβ€”it's the foundation of everything in NLP.
The rest of this module teaches you which representations to choose and why.

πŸš€

Part 1 Complete!

You now understand: WHY models need numbers, WHY text is uniquely challenging, and WHAT problem we're solving.

πŸ‘‰ Next: Let's see HOW this problem has been solved over time!

Part 2: The Evolution Story

From Simple Counting to Semantic Understanding

Now that we know why we need to convert text to numbers, let's explore how this problem has been solved over the past 30+ years. Each approach built on the limitations of the previous one.

The 30-Year Evolution Timeline

graph LR A["1990s-2000s
πŸ“Š Bag of Words
Simple counting"] --> B["2000s
βš–οΈ TF-IDF
Smart weighting"] B --> C["2013+
🧠 Word2Vec/GloVe
Semantic vectors"] C --> D["2018+
πŸ€– BERT/Contextual
Context-aware"] D --> E["2020+
πŸš€ Modern LLMs
Same principles,
more sophisticated
"] style A fill:#E3F2FD,stroke:#0078D4,stroke-width:3px style B fill:#FFE0B2,stroke:#F7630C,stroke-width:2px style C fill:#F3E5F5,stroke:#9C27B0,stroke-width:2px style D fill:#C8E6C9,stroke:#107C10,stroke-width:2px style E fill:#FFECB3,stroke:#FBC02D,stroke-width:2px

Each era solved specific problems but introduced new challenges

Your Journey Through Part 2

πŸ“Š
Section 4: Bag of Words

The simplest approach - just count!

Era: 1990s-2000s

Type: Sparse, discrete counts

βš–οΈ
Section 5: TF-IDF

Smarter counting with weights

Era: 2000s

Type: Weighted sparse vectors

🧠
Section 6: Word2Vec/GloVe

The semantic leap - dense vectors

Era: 2013+

Type: Dense, semantic embeddings

πŸ€–
Section 7: Contextual (BERT)

Context matters - dynamic meaning

Era: 2018+

Type: Contextual embeddings

⚠️ Important: What You'll Learn

βœ“ Each approach has trade-offs - no single "best" solution

βœ“ Embeddings are learned from data, not magic

βœ“ Vector arithmetic (king-queen) works for Word2Vec but NOT universally

βœ“ Modern LLMs use the same core principles, just more sophisticated

πŸ‘‡ Section 4: Bag of Words - The Simplest Approach

Section 4: The Simplest Approach - Bag of Words

πŸ“Š

Let's start with the most intuitive idea: just count the words!

The Core Idea

Bag of Words treats text as an unordered collection of words. We simply count how many times each word appears. It's like throwing all the words into a bag, forgetting their order, and counting them.

Bag of Words: From Text to Count Matrix

Input Text

"I love NLP"

β†’

tokenize

Tokens

["I", "love", "NLP"]

β†’

Build Vocab

Vocabulary

{"I":0, "love":1, "NLP":2}

↓

Count

Count Vector

I

1

love

1

NLP

1

Corpus: ["I love NLP", "I love ML"]

Result: Each doc becomes a vector of word counts

β†’ Sparse, high-dimensional, but interpretable

The Step-by-Step Process

Step 1: Tokenization

Split text into individual words (tokens)

"I love NLP" β†’ ["I", "love", "NLP"]
Step 2: Build Vocabulary

Collect all unique tokens from the entire corpus

{"I": 0, "love": 1, "NLP": 2, "ML": 3, ...}
Step 3: Count Occurrences

For each document, count how many times each vocab word appears

Document vector: [1, 1, 1, 0, 0, ...]

Code Example with sklearn

Python (sklearn CountVectorizer)
from sklearn.feature_extraction.text import CountVectorizer

# Documents
docs = [
    "I love machine learning",
    "I love coding",
    "machine learning is amazing"
]

# Create and fit vectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(docs)

# Vocabulary (sorted alphabetically)
print("Vocabulary:", vectorizer.get_feature_names_out())
# Output: ['amazing' 'coding' 'is' 'learning' 'love' 'machine']

# BoW Matrix (sparse by default, converting to dense for display)
print(X.toarray())
# Output:
# [[0 0 0 1 1 1]    ← Doc 1: "I love machine learning"
#  [0 1 0 0 1 0]    ← Doc 2: "I love coding"
#  [1 0 1 1 0 1]]   ← Doc 3: "machine learning is amazing"

# Notice: Most values are 0 (sparse!)

Strengths vs Limitations

BoW Strengths vs Limitations
⚠️

The Fatal Flaw: "The movie was not good" vs "The movie was good"

BoW produces nearly identical vectors because word order is lost! This is why we needed better approaches.

πŸ’‘

When to Use BoW:

  • Quick baseline for classification tasks (surprisingly effective!)
  • Document similarity with controlled vocabulary
  • When interpretability matters (can see which words drove the decision)
  • Limited computational resources
  • Topic modeling and keyword extraction
πŸ’» In the notebook, we'll implement BoW using scikit-learn's CountVectorizer on the movie reviews dataset and see how well it performs as a baseline.

Section 5: Smarter Counting - TF-IDF

βš–οΈ

Not all words are equally informative. TF-IDF weighs words by importance.

The Problem with Raw Counts

In BoW, common words like "the", "is", "a" get high counts but tell us little. Rare, specific words like "brilliant" or "terrible" are more informative for sentiment analysis.

πŸ”₯

The Core Insight: Frequent words across all documents matter less. Rare but present words matter more.

How TF-IDF Works

TF-IDF = Term Frequency Γ— Inverse Document Frequency

It's a two-part formula that balances local frequency (in the document) with global rarity (across all documents).

TF-IDF Formula Breakdown

Example Comparison: The Reweighting Effect

BoW vs TF-IDF Weight Comparison

Notice: "brilliant" has only 2 occurrences but gets the highest TF-IDF score (0.89) because it's rare and informative!

The Impact: Why TF-IDF Matters

TF-IDF Impact

BoW vs TF-IDF: Decision Guide

πŸ“Š Use Bag of Words When:
  • You need a quick baseline
  • Vocabulary is small & controlled
  • Speed is critical
  • You want maximum interpretability
  • Simple classification tasks
βš–οΈ Use TF-IDF When:
  • Common words are drowning signal
  • Information retrieval / search systems
  • Document similarity / classification
  • You need better feature quality
  • Keyword extraction tasks

❌ Without TF-IDF

doc1 = "the the the movie"
doc2 = "the the the film"

BoW weights: [3, 1] and [3, 1]
# "the" dominates (75% of features)
# Can't distinguish docs well

Problem: Common words overwhelm signal

βœ… With TF-IDF

doc1 = "the the the movie"
doc2 = "the the the film"

TF-IDF weights: [0.2, 0.8] and [0.2, 0.8]
# "the" downweighted (20%)
# "movie"/"film" emphasized (80%)

Solution: Informative words dominate

πŸ’‘

Key Takeaway

TF-IDF improves on BoW by recognizing that not all words are equally valuable. It amplifies informative words and suppresses common noise.

But both BoW and TF-IDF share a fundamental limitation: they treat words as independent symbols with no semantic relationship.

In the notebook, we'll compare BoW and TF-IDF using scikit-learn's TfidfVectorizer and see which performs better on movie review sentiment classification.

Common Mistake: Data Leakage in Vectorization

The Error:

# WRONG! Fitting on all data
vectorizer = TfidfVectorizer()
all_vectors = vectorizer.fit_transform(all_texts)  # ❌
X_train = all_vectors[:800]
X_test = all_vectors[800:]

Why It's Wrong: The vectorizer sees test data during fit(), learning vocabulary and IDF weights from test set. This leaks information!

The Fix:

# CORRECT! Fit only on train
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(train_texts)  # βœ… Fit on train
X_test = vectorizer.transform(test_texts)         # βœ… Transform test

Impact: Data leakage can inflate test accuracy by 5-10%, leading to production failures!

πŸ€” Check Your Understanding

Q1: What's the key difference between BoW and TF-IDF?

βœ… Correct: BoW counts words, TF-IDF weighs them by importance
❌ Wrong: BoW is faster than TF-IDF
❌ Wrong: TF-IDF only works on English text
πŸ’‘ Explanation: TF-IDF improves on BoW by reweighting words based on their rarity across documents. Common words get downweighted, informative words get emphasized!

Q2: When does TF-IDF help most?

βœ… Correct: When common words (the, is, a) are drowning out informative words
❌ Wrong: When you have very little training data
❌ Wrong: When you need real-time predictions
πŸ’‘ Explanation: TF-IDF shines when common "stop words" dominate your BoW vectors. It automatically identifies and downweights these uninformative words, letting the meaningful terms shine through!

Section 6: The Semantic Leap - Dense Word Embeddings

🎯

What if words could be represented as dense vectors that capture meaning?

From Sparse to Dense: The Paradigm Shift

Instead of sparse vectors with mostly zeros (BoW/TF-IDF), embeddings are dense vectors (typically 100-300 dimensions) where every single value is meaningful.

Sparse vs Dense Paradigm
❌ Sparse Vectors Problem
  • Dimension = vocabulary size (50,000+)
  • 99.99% zeros (wasted space)
  • No semantic relationships
  • "good" and "great" are unrelated
βœ… Dense Embeddings Solution
  • Fixed dimensions (100-300)
  • 100% dense (every value matters)
  • Captures semantic meaning
  • "good" and "great" are similar!

The Magic Property: Semantic Similarity

Words with similar meanings have similar vectors! This is the breakthrough that made embeddings revolutionary.

Semantic Clustering

This was IMPOSSIBLE with BoW/TF-IDF! Sparse methods treated all words as equally unrelated. Embeddings capture that "king" and "queen" are similar concepts, while "king" and "dog" are not.

Measuring Similarity: Cosine Similarity

We measure semantic similarity using the cosine similarity between vectors (range: -1 to 1, where 1 means identical).

Python Example
from sklearn.metrics.pairwise import cosine_similarity

similarity("king", "queen")   = 0.72  # High! Similar concepts
similarity("king", "monarch") = 0.68  # High! Synonyms
similarity("king", "apple")   = 0.03  # Low! Unrelated

# BoW/TF-IDF would treat "king-queen" and "king-apple"
# as equally unrelated (both = 0 overlap)

Popular Static EmbeddingA dense numeric vector representation of text that captures semantic meaning in continuous space. Models

Three foundational approaches, each with different training objectives:

🧠
Word2Vec

Training Objective:
Predict context words from target word (Skip-gram) or target from context (CBOW)

Example Model:
GoogleNews-vectors-negative300

Key Insight:
Words appearing in similar contexts get similar vectors

Dimensions: 300
VocabularyThe set of all unique tokens recognized by a tokenizer or model. Out-of-vocabulary (OOV) tokens are unknown.: 3M words

🌐
GloVe

Training Objective:
Factorize global word co-occurrence matrix

Example Model:
glove.6B.300d

Key Insight:
Combines global statistics with local context

Dimensions: 50/100/200/300
Vocabulary: 400K words

⚑
FastText

Training Objective:
Like Word2Vec but with subword n-grams

Example Model:
cc.en.300.bin

Key Insight:
Handles rare/OOV words better using character n-grams

Dimensions: 300
Vocabulary: 2M words

How Are Embeddings Learned?

Embeddings aren't magicβ€”they're learned through training! Let's visualize how the training process works:

How Embeddings Are Learned: The Training Process

1. Context Window

Sentence:

"the king rules the land"

Target: "king"

Context: ["the", "rules"]

β†’
2. Lookup

Get "king"

embedding

[300 dims]

β†’
3. Predict

Context
words

βœ“ "the"

βœ“ "rules"

4. Compute Loss

How well did we predict context?

Loss = prediction error

High loss = bad embedding

β†’
5. Update Embedding

Backpropagation adjusts vectors

embedding ← embedding βˆ’ gradient

Move vectors closer for similar contexts

↻

Repeat millions of times!

The Key Insight

Words appearing in similar contexts get updated in similar ways

"king" and "queen" both appear near: "the ___ ruled", "___ of England"

β†’ Their embeddings become similar through training!

The Core Idea: Words that appear in similar contexts get similar embeddings. "king" and "queen" both appear near phrases like "the ___ ruled" and "___ of England", so their vectors become similar through millions of training iterations!

Vector Arithmetic: The Famous "King - Queen" Example

Static embeddings (especially Word2Vec) show fascinating linear patterns in semantic space:

Vector Arithmetic: The Famous "King - Queen" Example

king

βˆ’

man

+

woman

β‰ˆ

queen

What This Means:

In vector space, these words combine to transform meaning:

β†’

king

β€Ί

βˆ’ man

β†’

+ woman

β‡’

β‰ˆ queen

Royalty

"king" keeps royal status

β†’ Still royalty

Gender

Subtract male, add female

β†’ Gender transformation

Result

Female + Royalty

β†’ "queen"

Other Famous Examples:

Paris βˆ’ France + Germany β‰ˆ Berlin

walking βˆ’ walk + swim β‰ˆ swimming

Why Does This Work?
  • Gender is captured as a vector direction
  • Royalty is preserved through the operation
  • Relationships are encoded as vector offsets
  • Training objective naturally creates these linear patterns
⚠️

Important Caveat:

Vector arithmetic works best with static embeddings like Word2Vec trained on specific objectives. This property does NOT universally transfer to all embedding types!

Use this as intuition-building, not as a guarantee across all models. (We'll see why when we discuss contextual embeddings next.)

When to Use Static Embeddings

βœ…

Good use cases:

  • Similarity search and clustering
  • Lightweight text classification with limited data
  • Feature extraction for downstream models
  • Fast inference requirements
In the notebook, we'll load pre-trained Word2Vec embeddings using gensim and explore similarity, analogies, and visualization.

Section 7: Context Matters - Contextual Embeddings

πŸ”„

Static embeddings have a problem: words mean different things in different contexts.

The Polysemy Problem: Why Static Embeddings Fail

Consider the word "bank" - it has completely different meanings in different contexts:

The Polysemy Problem: Same Word, Different Meanings

The Word:

"bank"

Context 1: Financial

"I deposited money at the bank"

Meaning: Financial Institution 🏦

Related: money, deposit, account

Context 2: Geographic

"We sat by the river bank"

Meaning: Edge of River 🏞️

Related: river, shore, water

❌ Static Embeddings Problem

"bank" always gets THE SAME vector regardless of context!

vector("bank") = [0.23, -0.15, 0.89, ...] (always identical)

Cannot distinguish between financial institution and river edge

The Solution: Contextual Embeddings

Models like BERT and GPT solve this by generating different embeddings for the same word based on surrounding context. This is the major breakthrough that enabled modern NLP!

Static vs Contextual Comparison

How Contextual Embeddings Work: The Attention Mechanism

Models like BERT use Transformer architectures with an attention mechanism that allows each word to "look at" all other words in the sentence.

Attention Mechanism

The Power of Attention: In "The movie was not good", the words "not" and "good" have strong attention to each other. This allows the model to understand that "not good" = negative sentiment, something static embeddings could never capture!

Static vs Contextual: Key Differences

πŸ“Š Static Embeddings

Word2Vec, GloVe, FastText

βœ“ Representation: One fixed vector per word

βœ— Polysemy: Cannot distinguish meanings

βœ“ Training: Shallow models, faster

βœ“ Vector arithmetic: Clean linear patterns

πŸ’‘ Use case: Similarity, clustering, fast inference

πŸ€– Contextual Embeddings

BERT, GPT, Modern LLMs

βœ“ Representation: Different vector per context

βœ“ Polysemy: Handles multiple meanings naturally

βœ— Training: Deep Transformers, slower

⚠️ Vector arithmetic: Less consistent (context-dependent)

πŸ’‘ Use case: Complex NLP tasks, fine-tuning

⚠️ CRITICAL: Interpretability Caveat

This is one of the most important concepts to understand when working with embeddings!

Interpretability Caveat: Vector Arithmetic

βœ… Static Embeddings

Word2Vec / GloVe

king - man + woman = queen

Why it works:

β€’ One fixed vector per word

β€’ Trained for semantic relationships

⚠️ Contextual Embeddings

BERT / GPT / Modern LLMs

king - man + woman = ???

Why it's inconsistent:

β€’ Different vector per context

β€’ Trained for task performance

Why Vector Arithmetic Doesn't Transfer to Contextual Models
1

Training Objective: BERT/GPT optimize for masked tokens or next-token prediction,

NOT explicit semantic relationships like Word2Vec

2

Context Dependence: Same word = different vectors in different contexts,

so no single "king" vector exists to manipulate

3

Design Tradeoff: Contextual models prioritize task performance

over interpretable linear structure

πŸ’‘ Use vector arithmetic as pedagogical intuition for static embeddings,

not as a universal embedding property!

πŸ”₯

CRITICAL Teaching Point:

The vector arithmetic intuition (king - man + woman β‰ˆ queen) is a beautiful property of static embeddings like Word2Vec, but it does NOT universally transfer to all embedding types!

Why the README emphasizes this: It's tempting to overgeneralize this property to all embeddings, but contextual models work fundamentally differently. Use vector arithmetic as pedagogical intuition for static embeddings, not as a guarantee across all embedding families!

Popular Contextual Models

πŸ€–

Models to explore:

  • BERT (bert-base-uncased): Bidirectional, excellent for understanding tasks
  • GPT (gpt2): Left-to-right, excellent for generation
  • Sentence-BERT (all-MiniLM-L6-v2): Optimized for sentence similarity
πŸ’‘

Key Takeaway

The evolution from static to contextual embeddings solves the polysemy problem. Modern NLP uses contextual embeddings by default, but static embeddings remain valuable for fast, lightweight applications.

Both share the same principle: convert text to dense vectors that capture meaning, but contextual models add context-awareness.

In the notebook, we'll use sentence-transformers to generate contextual embeddings and compare them with static Word2Vec embeddings.
PART 3 OF 7

βš™οΈ Understanding the Mechanics

Now that you know what embeddings are and why they evolved, let's understand how they actually work under the hood.

πŸ”¬

How do embeddings learn?

From random numbers to meaningful vectors through training

🎯

What shapes embeddings?

Training data, objectives, and domain dependencies

βœ‚οΈ

Why tokenization matters?

The critical first step that determines everything

Your Journey Through Part 3

graph TB S1["Section 8
βš™οΈ How Embeddings Emerge
Training process demystified"] --> S2["Section 9
βœ‚οΈ Tokenization Introduction
The critical first step"] S2 --> S3["Section 10
πŸ› οΈ Building Tokenizers
Custom tokenization from scratch"] S3 --> S4["Section 11
πŸ”— Tokenizer Impact
Effects on embeddings & tasks"] style S1 fill:#F3E5F5,stroke:#9C27B0,stroke-width:3px style S2 fill:#E8EAF6,stroke:#7B1FA2,stroke-width:2px style S3 fill:#E1BEE7,stroke:#7B1FA2,stroke-width:2px style S4 fill:#CE93D8,stroke:#7B1FA2,stroke-width:2px

πŸ’‘ What You'll Understand

βœ“ How embeddings start as random numbers and become meaningful through training

βœ“ Why different training objectives (Word2Vec vs BERT) create different embedding spaces

βœ“ How tokenization choices affect vocabulary size, OOV handling, and sequence length

βœ“ The tradeoffs between word-level, character-level, and subword tokenization

πŸ‘‡ Section 8: How Embeddings Emerge from Training

Section 8: How Embeddings Emerge from Training

πŸ”¬

Embeddings aren't magicβ€”they're learned parameters shaped by data and objectives.

The Learning Process: From Random to Meaningful

Embeddings start as completely random numbers and gradually evolve through training to capture meaningful semantic patterns. Let's visualize this transformation!

From Random Numbers to Meaningful Vectors

❌ Before Training: Random

"king"

"queen"

"apple"

Similarity("king", "queen") = 0.03

❌ Not similar at all!

β†’

Training

Millions of
examples

βœ… After Training: Meaningful

"king"

}

"queen"

Similar!

"apple"

Similarity("king", "queen") = 0.85

βœ… Now very similar!

How It Happens:
1
Similar Contexts

"king" and "queen" appear in similar contexts:

"the ___ ruled"
"___ of England"

2
Prediction Task

Model learns to predict context from word

Similar contexts need similar vectors!

3
Gradient Updates

Backpropagation adjusts vectors

Similar usage β†’ similar vectors

Core Principle: Similar usage patterns in training β†’ Similar vectors in learned space

This is why embeddings capture semantic relationships without explicit programming!

The Magic of Training: Through millions of examples, the model learns that "king" and "queen" appear in similar contexts ("the ___ ruled", "___ of England"). To predict these contexts accurately, their vectors must become similar!

The Training Process (5 Steps)

graph TB A["Step 1: Random Initialization
Each word gets random vector"] --> B["Step 2: Training Objective
Predict context, reconstruct, etc."] B --> C["Step 3: Compute Loss
How wrong are predictions?"] C --> D["Step 4: Backpropagation
Update vectors to reduce loss"] D --> E["Step 5: Repeat
Millions of times"] E --> F["Result: Meaningful Embeddings
Similar words β†’ similar vectors"] style A fill:#F3F2F1,stroke:#0078D4 style F fill:#107C10,stroke:#107C10,color:#fff

Training Dynamics

Embedding Training Loss Curve

Training dynamics: Loss decreases rapidly initially, then converges around epoch 35. Validation loss tracks training closely, indicating good generalization.

Key Insight: What Shapes Embeddings?

Embeddings are NOT universal truthβ€”they are shaped by three key factors during training:

What Shapes Embeddings? Three Key Dependencies

Embeddings are shaped by how they were trainedβ€”NOT universal truth

1
Training Data Dependency

The corpus determines patterns learned

β€’ Wikipedia: General world knowledge, formal language

β€’ Medical Journals: Clinical terminology, disease names

β€’ Twitter/Social Media: Informal, slang, abbreviations, emojis

β€’ Programming Repos: Code syntax, technical terms

Same word, different embeddings!

"Python" + snake (0.65) vs "Python" + Java (0.82)

2
Training Objective Dependency

The task determines geometric structure

β€’ Word2Vec: Context prediction β†’ clean linear patterns

β€’ GPT: Next token β†’ generation patterns

β€’ BERT: Masked tokens β†’ contextual understanding

β€’ Sentence-BERT: Similarity β†’ sentence-level clusters

Different objectives = Different geometry!

Word2Vec: king-man+woman βœ“ | BERT: Less reliable β–³

3
Domain Context Dependency

Specialized corpora emphasize domain meanings

β€’ Finance Corpus: "bank" β†’ institution, deposits, loans

β€’ Gaming Corpus: "bank" β†’ money storage, vault

β€’ Geography Corpus: "bank" β†’ river edge, shore, waterside

β€’ General Corpus: "bank" β†’ mixed meanings, less specific

Domain specialization matters!

Finance model: bank + loan (0.89) | Geography: bank + river (0.91)

⚠️ Critical Implication

This is why choosing the RIGHT pre-trained model for your domain matters!

A model trained on Wikipedia β‰  Model trained on Twitter β‰  Model trained on medical texts

πŸ”₯

Critical Implication: This is why choosing the right pre-trained model for your domain matters!

A model trained on Wikipedia will have different embeddings than one trained on Twitter or medical texts, even with the same architecture.

Embedding Dimensions: Finding the Sweet Spot

Embedding Dimensions vs Performance

The sweet spot: 300 dimensions balances accuracy (89%) with reasonable training time (60s). Beyond that, diminishing returns.

Common Training Objectives

Different training objectives lead to different embedding spaces. Here are the most popular approaches:

Context Prediction

Model: Word2Vec (Skip-gram/CBOW)

What It Learns: Words in similar contexts β†’ similar vectors

Example: Predict "cat" from ["The", "sat", "on"]

Co-occurrence Factorization

Model: GloVe

What It Learns: Global word relationships from statistics

Example: "king" and "queen" co-occur often

Masked Language Modeling

Model: BERT

What It Learns: Bidirectional context understanding

Example: Predict "[MASK]" in "The cat [MASK] on mat"

Next Token Prediction

Model: GPT

What It Learns: Left-to-right generation patterns

Example: Given "The cat sat", predict "on"

Contrastive Similarity

Model: Sentence-BERT

What It Learns: Sentence-level semantic similarity

Example: "The movie was great" should be close to "The film was excellent"

πŸ’‘

Key Takeaway

Embeddings are learned, not hand-crafted. They emerge from optimization over large data with specific objectives. Similar usage patterns in training data β†’ similar vectors in learned space.

This is why embeddings can capture nuanced semantic relationships that hand-crafted features miss.

πŸ‘‡ Section 9: Tokenization - The Critical First Step

Section 9: Tokenization - The Critical First Step

βœ‚οΈ

Before we can create embeddings, we must decide: how do we split text into tokens?

Why Tokenization Matters

Tokenization is NOT just preprocessingβ€”it's a critical architectural decision that determines vocabulary, granularity, and ultimately your model's capabilities.

How Tokenization Affects Everything

πŸ“Š Vocabulary Size

Determines model memory and embedding table size

Word: 100K-1M tokens

Subword: 30K-50K tokens

Char: <100 tokens

⏱️ Sequence Length

Affects computational cost and context window

Word: Short sequences

Subword: Medium sequences

Char: Very long sequences

πŸ” OOV Handling

How unknown words are processed

Word: UNK token (loses info)

Subword: Decompose (preserves info)

Char: Always known

πŸ’‘ Semantic Units

What meaningful chunks are preserved

Word: Natural units (best)

Subword: Morphemes (good)

Char: Letters only (weak)

⚠️ Critical Decision: Tokenization is not just preprocessing!

It fundamentally determines model architecture, performance, and behavior

Bad tokenization = Bad embeddings, no matter how good your model is

The Goldilocks Problem

Finding the right tokenization granularity is a balancing actβ€”too coarse, too fine, or just right:

The Goldilocks Problem of Tokenization

Example: "Tokenization is preprocessing"

Too Coarse

Word-Level

Tokens:

Token is prepr...

3 tokens

Pros:

+ Natural semantic units
+ Short sequences

Cons:

- Huge vocabulary (100K+)
- Cannot handle "Tokenizations"
- OOV words become UNK

Problem: Too rigid!

Just Right ⭐

Subword-Level (BPE)

Tokens:

Token ization is pre proc...

6 tokens

Pros:

+ Balanced vocabulary (30K)
+ Handles variations
+ Decomposes unknowns

Cons:

- Slightly longer sequences
- Requires training

Best balance!

Too Fine

Character-Level

Tokens:

T o k e ...

29 tokens

Pros:

+ Tiny vocabulary (<100)
+ No OOV ever

Cons:

- Very long sequences
- Loses semantic chunks
- Model must learn words

Problem: Too granular!

Modern NLP Solution: Subword Tokenization

Balances vocabulary size, sequence length, and OOV handling

Used by: BERT (WordPiece), GPT (BPE), T5 (Unigram)

Typical vocabulary: 30K-50K tokens

Token Length Distribution Across Strategies

Token Length Distribution

Token length distribution by strategy: Character-level produces the longest sequences, word-level the shortest, subword strikes the balance.

Three Tokenization Strategies

Example Text: "Tokenization is preprocessing"

Python
# Word-level tokenization
tokens = ["Tokenization", "is", "preprocessing"]
β†’ 3 tokens, simple, but what about "Tokenizations" (plural)?

# Subword-level tokenization (BPE-style)
tokens = ["Token", "ization", "is", "pre", "process", "ing"]
β†’ 6 tokens, handles variations like "Tokenize", "Tokenizer"

# Character-level tokenization
tokens = ["T","o","k","e","n","i","z","a","t","i","o","n"," ","i","s",...]
β†’ 29 tokens, handles any word but sequences are very long

Quick Comparison: Which Strategy When?

Word-Level

Vocab: 100K-1M+ πŸ“ˆ

Seq Length: Short βœ“

OOV: Poor (UNK) ❌

Semantics: Natural βœ“

Use when: Controlled, small vocabulary domains

Subword-Level ⭐

Vocab: 30K-50K βœ“

Seq Length: Medium βœ“

OOV: Good (decompose) βœ“

Semantics: Morphemes βœ“

Use when: General-purpose NLP (RECOMMENDED)

Character-Level

Vocab: <100 βœ“

Seq Length: Very long ❌

OOV: Perfect βœ“βœ“

Semantics: Must learn ⚠️

Use when: Noisy text, misspellings, extreme OOV

Popular Subword Algorithms

Modern NLP uses subword tokenization. Three main algorithms:

πŸ”—
Byte-Pair Encoding (BPE)

Algorithm:
Iteratively merge most frequent character pairs

Example:
"low", "lower", "lowest"
β†’ "l", "o", "w", "e", "r", "s", "t"
β†’ "low", "er", "est"

Used by: GPT, RoBERTa

Strength: Simple, effective, data-driven

πŸ“¦
WordPiece

Algorithm:
Merge based on likelihood increase

Example:
Similar to BPE but uses probability scoring

Used by: BERT

Strength: Slightly better linguistic properties than BPE

🎲
Unigram Language Model

Algorithm:
Start large, prune unlikely subwords

Example:
Probabilistic: can generate multiple tokenizations

Used by: T5, XLNet

Strength: Flexible, handles ambiguity

Tokenizer Fragmentation Comparison

Fragmentation comparison: Character-level severely fragments long words, while subword methods (BERT, GPT-2) balance well.

Real Tokenizer Examples

Same Text, Different Tokenizers

Python
Text: "The unbelievable performance!"

# BERT (WordPiece):
["The", "un", "##bel", "##iev", "##able", "performance", "!"]
β†’ Splits "unbelievable" into subwords with ## marker

# GPT-2 (BPE):
["The", "Δ un", "bel", "iev", "able", "Δ performance", "!"]
β†’ Δ  indicates space before token

# Character-level:
["T","h","e"," ","u","n","b","e","l","i","e","v","a","b","l","e",...]
β†’ Every character separate

Result: Same text β†’ different token IDs β†’ different embeddings!
πŸ”₯

Critical Consequence:

You MUST use the same tokenizer that was used during model pre-training. Mismatched tokenizers break embeddings!

Example: Don't use GPT-2 tokenizer with BERT embeddings!

In the notebook, we'll compare tokenizers from transformers library (BERT vs GPT-2) on the same text.

Section 10: Building a Tokenizer from Scratch

πŸ› οΈ

When would you build your own tokenizer, and how do you do it?

When to Build Custom Tokenizers

❓ Key Question:

Does a pre-trained tokenizer match your domain well enough?

βœ… Use Pre-trained (Recommended)

When:

  • General domain (news, web text)
  • Well-covered language (English, etc.)
  • Fast iteration needed
  • Limited training data
⚠️ Build Custom (When Necessary)

When:

  • Highly specialized domain
  • Severe vocabulary mismatch
  • Language not well-covered
  • Privacy/compliance needs
Four Scenarios Requiring Custom Tokenizers
πŸ₯
Specialized Domain

Medical/Legal/Code

Problem:

Pre-trained splits domain terms poorly

Solution:

"COVID-19" β†’ one token

🌍
Underrepresented Language

Low-resource languages

Problem:

Existing tokenizers fragment heavily

Solution:

Train on native corpus

πŸ’₯
Vocabulary Mismatch

Severe fragmentation

Problem:

Common words become 5+ tokens

Solution:

Domain-specific vocab

πŸ”’
Privacy/Compliance

Controlled training

Problem:

Cannot use external tokenizers

Solution:

Train on compliant data only

πŸ’‘ Practical Advice

Building custom tokenizers is expensive in time and iteration. Start with pre-trained tokenizers.

Only invest in custom tokenization when measurable performance gaps exist AND domain mismatch is the root cause.

The 6-Step Process

graph TB A["1. Define Corpus/Domain
What data represents your task?"] --> B["2. Normalization Policy
Lowercase? Keep punctuation?"] B --> C["3. Pre-tokenization Strategy
Whitespace? Regex? Language-specific?"] C --> D["4. Subword Algorithm Choice
BPE, WordPiece, or Unigram?"] D --> E["5. Vocabulary Design
Size? Special tokens? Reserved terms?"] E --> F["6. Validation Checks
Coverage? OOV rate? Fragmentation?"] style A fill:#F3F2F1,stroke:#0078D4 style F fill:#107C10,stroke:#107C10,color:#fff

Six Critical Steps with Key Decisions

1
Define Corpus/Domain

What data represents your task?

βœ“ Good: Domain-matched corpus

βœ— Bad: Generic Wikipedia for specialized domain

2
Normalization Policy

How to clean the text?

βœ“ Good: Preserve case when meaningful

βœ— Bad: Lowercase everything blindly

3
Pre-tokenization Strategy

How to split into initial chunks?

βœ“ Good: Regex or language-specific

βœ— Bad: Simple whitespace for all languages

4
Subword Algorithm Choice

Which algorithm to use?

βœ“ Good: BPE, WordPiece, or Unigram

βœ— Bad: Word-level or char-level only

5
Vocabulary Design

What size and special tokens?

βœ“ Good: 30K-50K with domain terms

βœ— Bad: Too small or too large

6
Validation Checks

Does it work well?

βœ“ Good: Check OOV, fragmentation, seq length

βœ— Bad: Skip validation, use blindly

Step-by-Step Details

Step 1: Corpus/Domain Definition

Python
# Bad: Training on Wikipedia for medical NLP
corpus = load_wikipedia()  ❌

# Good: Domain-matched corpus
corpus = load_medical_texts()  βœ…
corpus += load_clinical_notes()
corpus += load_research_papers()

# Result: Vocabulary matches your actual use case

Step 2: Normalization Policy

Python
# Decisions to make:
- Lowercase or preserve case?
  β†’ "Apple" (company) vs "apple" (fruit) - case matters!

- Remove accents/diacritics?
  β†’ "cafΓ©" β†’ "cafe"? Loss of meaning in some languages

- Handle numbers?
  β†’ "COVID-19" β†’ "COVID " or keep as-is?

- Unicode normalization?
  β†’ Different ways to represent Γ© (e + ́  vs single char)

Step 3: Pre-tokenization

Python
# Simplest: whitespace splitting
"Hello world!" β†’ ["Hello", "world!"]

# Better: regex-based
"Hello world!" β†’ ["Hello", "world", "!"]

# Language-specific:
# Chinese/Japanese need character-aware splitting
"δ½ ε₯½δΈ–η•Œ" β†’ ["δ½ ε₯½", "δΈ–η•Œ"]  (not character-level!)

Step 4: Algorithm Choice

Python
# Training BPE tokenizer (pseudocode)
from tokenizers import Tokenizer
from tokenizers.models import BPE

tokenizer = Tokenizer(BPE())
tokenizer.train(
    files=["medical_corpus.txt"],
    vocab_size=30000,
    min_frequency=2
)

# Now you have a domain-specific tokenizer!

Step 5: Vocabulary Design

Python
# Key decisions:
1. Vocab size:
   - Too small β†’ over-fragmentation
   - Too large β†’ rare tokens, large embedding matrix

2. Special tokens:
   [PAD], [UNK], [CLS], [SEP], [MASK]

3. Reserved terms (optional):
   Domain-specific entities that should NOT be split
   Example medical: "COVID-19", "MRI", "COPD"

Step 6: Validation

Python
# Check 1: Vocabulary coverage
test_texts = load_test_set()
oov_rate = compute_oov_rate(tokenizer, test_texts)
print(f"OOV rate: {oov_rate:.2%}")  # Goal: < 1%

# Check 2: Fragmentation
examples = ["unbelievable", "preprocessing", "COVID-19"]
for word in examples:
    tokens = tokenizer.encode(word).tokens
    print(f"{word} β†’ {tokens}")
    # Check if reasonable splits

# Check 3: Sequence length
avg_length = compute_avg_tokens(tokenizer, test_texts)
print(f"Average tokens: {avg_length}")  # Goal: reasonable for model
πŸ”₯

Practical Advice:

Building custom tokenizers is expensive in time and iteration. Start with pre-trained tokenizers for general domains.

Only invest in custom tokenization when measurable performance gaps exist AND domain mismatch is the root cause.

Section 11: How Tokenizer Choices Affect Embeddings and Tasks

πŸ”—

Tokenization isn't just preprocessingβ€”it directly impacts embedding quality and task performance.

Four Critical Effects of Tokenization

1
Granularity β†’ Semantic Cohesion

Over-fragmentation weakens meaning

❌ Bad: "unbelievable" β†’ 12 character tokens

["u","n","b","e","l","i","e","v","a","b","l","e"]

βœ… Good: "unbelievable" β†’ 2 subword tokens

["un", "believable"]

πŸ’‘ Better semantics β†’ Better embeddings

2
Coverage β†’ OOV Handling

Domain mismatch causes fragmentation

❌ Bad: General tokenizer

"thrombocytopenia" β†’ 8 fragments

βœ… Good: Medical tokenizer

"thrombocytopenia" β†’ 1 token

πŸ’‘ Better coverage β†’ Better task performance

3
Length β†’ Computational Cost

More tokens = quadratic attention cost

❌ Bad: Over-fragmented

15 tokens β†’ 15Β² = 225 operations

βœ… Good: Well-designed

8 tokens β†’ 8Β² = 64 operations

πŸ’‘ 3.5x speedup + memory savings

4
Transfer β†’ Pretraining Alignment

Different tokenizer = embeddings don't transfer

❌ Bad: Custom tokenizer + BERT model

Token IDs don't match β†’ garbage!

βœ… Good: BERT tokenizer + BERT model

Token IDs match β†’ works!

πŸ’‘ Token ID alignment is CRITICAL

Detailed Examples of Each Effect

Effect 1: Granularity β†’ Semantic Cohesion

Python
# Over-fragmentation weakens meaning
Word: "unbelievable"

# Bad tokenization (char-level):
["u","n","b","e","l","i","e","v","a","b","l","e"]
β†’ Model must learn from scratch that these chars = concept

# Good tokenization (subword):
["un", "believable"] or ["unbeliev", "able"]
β†’ Model sees morphological structure

Impact: Better semantics β†’ better embeddings

Effect 2: Coverage β†’ OOV Handling

Python
# Domain: Medical NLP
Text: "Patient has thrombocytopenia"

# General tokenizer:
["Patient", "has", "th", "##rom", "##bo", "##cy", "##top", "##enia"]
β†’ 8 fragments! Medical term lost

# Medical tokenizer:
["Patient", "has", "thrombocytopenia"]
β†’ 3 tokens, medical term preserved

Impact: Better domain coverage β†’ better task performance

Effect 3: Length β†’ Computational Cost

Python
# Same text, different tokenizers
Text: "The patient presented with severe symptoms"

# Over-fragmenting tokenizer:
β†’ 15 tokens β†’ 15Β² attention matrix = 225 operations

# Well-designed tokenizer:
β†’ 8 tokens β†’ 8Β² = 64 operations

Impact: 3.5x speedup! (plus memory savings)

Effect 4: Transfer β†’ Pretraining Alignment

Python
# Scenario: Fine-tuning BERT

# ❌ Wrong: Use different tokenizer
custom_tokenizer = train_bpe(my_data)
bert_model = load_bert()
# Token IDs don't match β†’ embeddings are garbage!

# βœ… Right: Use BERT's tokenizer
bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
bert_model = AutoModel.from_pretrained("bert-base-uncased")
# Token IDs match β†’ embeddings transfer correctly

Tokenization Impact Across Different Tasks

πŸ“Š
Classification
Impact: Moderate-High

65% Severity

Critical Factors:

  • Semantic cohesion of tokens
  • Domain vocabulary coverage
  • Handling of rare/OOV terms

Sentiment: "not good" β†’ needs proper boundaries

πŸ”
Retrieval/Search
Impact: Very High

85% Severity

Critical Factors:

  • Term matching precision
  • Vocabulary overlap
  • Query-document alignment

Search: "COVID-19" must match exactly

✍️
Text Generation
Impact: Critical

95% Severity

Critical Factors:

  • Fluency depends on boundaries
  • Word formation quality
  • Coherent token sequences

GPT: Poor tokenization β†’ broken words

🏷️
NER/Token Classification
Impact: Critical

95% Severity

Critical Factors:

  • Entity boundaries align with tokens
  • No mid-entity fragmentation
  • Consistent entity representation

NER: "New York" β†’ must stay together

⚠️ Critical Insight

Tokenization is NOT just preprocessingβ€”it's a representation design decision

Bad tokenization β†’ fragmented semantics β†’ weak embeddings β†’ poor task performance

This happens regardless of model architecture quality!

πŸ’‘

Key Takeaway

Tokenization is NOT just preprocessingβ€”it's a representation design decision. It determines vocabulary, granularity, sequence length, and ultimately embedding quality.

Bad tokenization β†’ fragmented semantics β†’ weak embeddings β†’ poor task performance, regardless of model architecture!

In the notebook, we'll analyze tokenization impact by comparing BERT tokenizer vs custom BPE on domain-specific text.
PART 4 OF 7

🎯 Making Practical Decisions

Theory is great, but now comes the real question: What should YOU actually do?

πŸ€”

Direct embeddings or fine-tune?

When to use embeddings as-is vs training

πŸ—οΈ

Build or use pretrained?

Should you train from scratch or leverage existing models?

⚠️

What to watch out for?

Common pitfalls and how to avoid them

Decision Framework

graph LR S1["Section 12
🎯 Direct vs Fine-Tuned
Decision matrix"] --> S2["Section 13
πŸ—οΈ Scratch vs Pretrained
When to build"] S2 --> S3["Section 14
⚠️ Practical Pitfalls
What to avoid"] style S1 fill:#FFF3E0,stroke:#E65100,stroke-width:3px style S2 fill:#FFE0B2,stroke:#E65100,stroke-width:2px style S3 fill:#FFCCBC,stroke:#E65100,stroke-width:2px

βœ… Clear Decisions Ahead

βœ“ A practical decision matrix for choosing between direct embeddings and fine-tuning

βœ“ When building from scratch makes sense (spoiler: almost never!)

βœ“ Common pitfalls that waste time and how to avoid them

Section 12: When to Use Embeddings Directly

🎯

Should you use embeddings as-is, or fine-tune a full model?

Decision Matrix

Use Case Direct Embeddings Fine-Tuned Model Recommended Approach
Semantic search / RAG βœ… Perfect fit ❌ Overkill Direct embeddings (sentence-transformers)
Clustering / Topic grouping βœ… Fast and effective ❌ Not needed Direct embeddings
Near-duplicate detection βœ… Cosine similarity works ❌ Expensive Direct embeddings
Simple classification (small data) βœ… Good baseline ⚠️ May overfit Start with embeddings + simple classifier
Token-level tasks (NER) ❌ No token boundaries βœ… Required Fine-tuned model
Generation tasks ❌ Not applicable βœ… Required Full generative model
Complex classification (large data) ⚠️ May plateau βœ… Better accuracy Fine-tune if embeddings underperform
Highly specialized domain ⚠️ If pretrained fits βœ… If domain gap large Depends on domain mismatch severity

Practical Heuristic

βœ…

The Default Path:

  1. Start with direct embeddings for fast iteration
  2. Evaluate performance and error patterns
  3. Move to fine-tuning ONLY when error analysis shows representation limits

Most tasks don't need fine-tuning. Save time and compute for when it actually matters.

Section 13: Build from Scratch vs Use Pretrained

πŸ—οΈ

Should you train your own tokenizer and embeddings, or use pretrained?

The Decision Framework

graph TD A[Start] --> B{Is your domain
close to general language?} B -->|Yes| C[Use Pretrained] B -->|No| D{Do you have
large domain corpus?} D -->|No| C D -->|Yes| E{Is performance gap
measurable and large?} E -->|No| C E -->|Yes| F{Do you have
time & compute budget?} F -->|No| C F -->|Yes| G[Consider Custom Training] style C fill:#107C10,stroke:#107C10,color:#fff style G fill:#F7630C,stroke:#F7630C,color:#fff

When to Use Pretrained

πŸš€

Default to pretrained when:

  • Domain is general or mainstream (news, social media, web text)
  • Speed and baseline quality are priority
  • Labeled data is limited
  • Team lacks NLP infrastructure expertise
  • Compute/time budget is constrained

Recommended models: sentence-transformers, BERT variants, GPT variants

When to Consider Custom Training

⚠️

Only consider custom when:

  • Domain language is highly specialized (medical, legal, scientific, code)
  • Vocabulary mismatch causes severe over-fragmentation
  • Compliance/privacy requires controlled training pipelines
  • Long-term product value justifies maintenance cost
  • Measurable performance gap exists AND domain mismatch is root cause

Custom training is expensive: requires data, compute, expertise, and ongoing maintenance.

Section 14: What to Watch Out For

⚠️

Representation choices are never neutral. Here's what to evaluate:

Design Considerations

Factor Why It Matters Tradeoff
Token granularity Word vs subword vs char affects semantics Coarse = simpler but less flexible
Fine = flexible but longer sequences
Normalization rules Case, punctuation, numbers affect meaning Aggressive = cleaner but loses nuance
Minimal = preserves signal but noisy
Domain vocabulary coverage OOV tokens break semantics General model = broad but shallow
Domain model = deep but narrow
Sequence length Longer sequences = more compute/memory Long context = better understanding but slower
Short context = faster but may truncate
Task dependency Classification vs retrieval vs generation Task-specific optimization vs general purpose

Common Pitfalls to Avoid

🚫

Don't Do These:

  • Fitting vectorizers on full data: Use train split only to avoid data leakage!
  • Over-cleaning text: Removing "not" or punctuation can reverse sentiment
  • Tokenizer mismatch: Don't use GPT-2 tokenizer with BERT embeddings
  • Expecting Word2Vec arithmetic everywhere: Contextual embeddings don't work the same way
  • Ignoring Unicode issues: Encoding problems create garbage tokens
πŸ’‘

Key Takeaway

Every representation decision is a tradeoff. There's no universal "best" tokenizer or embedding model. The right choice depends on your domain, task, data, and constraints.

Start practical: use pretrained, iterate fast, and only invest in custom solutions when measurable gaps justify the cost.

PART 5 OF 7

πŸ“‹ Practical Guidelines

From raw text to production: concrete, actionable recommendations for building text representation pipelines.

🎯 Your End-to-End Workflow

1️⃣

Understand Task

2️⃣

Text EDA

3️⃣

Choose Strategy

4️⃣

Select Tokenizer

5️⃣

Preprocess

6️⃣

Generate Vectors

7️⃣

Build Baseline

8️⃣

Iterate!

πŸ’‘ Core Principle

Start simple, iterate based on errors. Don't jump to complex solutions before understanding where simple approaches fail.

Section 15: Recommended Workflow

πŸ“‹

From raw text to production: a practical step-by-step guide.

The 8-Step Workflow

graph TB A["1. Understand Your Task
Classification? Retrieval? Generation?"] --> B["2. Perform Text EDA
Length, vocabulary, quality checks"] B --> C["3. Choose Representation Strategy
Sparse (BoW/TF-IDF) or Dense (embeddings)?"] C --> D["4. Select Tokenizer
Match to model if using pretrained"] D --> E["5. Apply Preprocessing
Normalize, clean (minimal!), handle special cases"] E --> F["6. Generate Representations
Vectors for training"] F --> G["7. Build Baseline Model
Start simple"] G --> H["8. Iterate Based on Errors
Analyze failures, improve representations"] H -.->|If needed| C style A fill:#F3F2F1,stroke:#0078D4 style H fill:#107C10,stroke:#107C10,color:#fff

Step-by-Step Details

Step 1 & 2: Task + EDA

Python
# Understand your task
task = "sentiment classification"
metric = "F1-score"

# Text EDA essentials
import pandas as pd

df = load_data()
print(df.describe())

# Check text length distribution
df['text_length'] = df['text'].str.len()
df['word_count'] = df['text'].str.split().str.len()

# Vocabulary richness
unique_words = set(' '.join(df['text']).split())
print(f"Vocabulary size: {len(unique_words)}")

# Class balance
print(df['label'].value_counts())

Step 3-5: Representation Strategy

Python
# Start with simplest that might work
from sklearn.feature_extraction.text import TfidfVectorizer

# Option A: TF-IDF baseline
vectorizer = TfidfVectorizer(max_features=5000, min_df=2)
X_train = vectorizer.fit_transform(train_texts)  # fit on train only!
X_test = vectorizer.transform(test_texts)

# Option B: Pretrained embeddings
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')
X_train = model.encode(train_texts, show_progress_bar=True)
X_test = model.encode(test_texts)

# Compare both approaches!

Quick Start Guide: Which Representation Strategy?

TF-IDF

Sparse, Simple, Fast

When to Use:

  • Small-medium datasets
  • Text classification
  • Need interpretability
  • Limited compute

βœ“ Pros:

Very fast, interpretable, no pretrained model needed

βœ— Cons:

No semantics, high dimensionality, OOV issues

Quick Start:

TfidfVectorizer()

πŸ’‘ Best for baselines

⭐ RECOMMENDED
Sentence Embeddings

Dense, Semantic, Pretrained

When to Use:

  • Semantic search/RAG
  • Clustering
  • Similarity tasks
  • Medium-large datasets

βœ“ Pros:

Captures semantics, low dimensionality, handles OOV well

βœ— Cons:

Less interpretable, slower than TF-IDF

Quick Start:

SentenceTransformer("all-MiniLM-L6-v2")
Fine-Tuned Models

Task-Specific, Powerful

When to Use:

  • NER, token classification
  • Generation tasks
  • Complex classification
  • Large labeled datasets

βœ“ Pros:

Best accuracy, task-adapted, contextual

βœ— Cons:

Slow training, needs labeled data, high compute cost

Quick Start:

AutoModel.from_pretrained("bert-base")

🎯 When max accuracy needed

🎯 Decision Rule

Start with TF-IDF baseline β†’ Try Sentence Embeddings for semantic tasks β†’ Fine-tune only when needed

Most problems don't need fine-tuning! Embeddings work great for 80%+ of use cases.

Quick Reference: Model Selection by Task

Task Recommended Starting Point When to Upgrade
Text classification TF-IDF + Logistic Regression Baseline < 80% accuracy
Semantic search sentence-transformers Rare, already near-optimal
Clustering Word2Vec or GloVe averaged Clusters not semantically coherent
NER / Token tasks Fine-tuned BERT N/A (start with full model)
βœ…

Best Practices:

  • Version everything: Tokenizer, model, preprocessing pipeline
  • Monitor OOV rate: High OOV = representation problem
  • Check sequence length: Truncation = information loss
  • Validate on held-out data: Avoid overfitting to test set
  • Error analysis first: Before scaling up, understand failures
The hands-on notebook implements this full workflow on the movie reviews dataset, from EDA through multiple representation strategies to final evaluation.
PART 6 OF 7

πŸ”— Bringing It All Together

Connect all the pieces: from raw text through embeddings to modern LLMs. See the complete picture.

βš™οΈ

The Complete Pipeline

Every step from text to predictionsβ€”how it all connects

πŸš€

Modern LLMs Connection

How everything applies to GPT, BERT, and Claude

Connecting the Dots

graph LR S1["Section 16
βš™οΈ Complete Pipeline
End-to-end flow"] --> S2["Section 17
πŸš€ Modern LLMs
Evolution & connection"] style S1 fill:#F3E5F5,stroke:#7B1FA2,stroke-width:3px style S2 fill:#E1BEE7,stroke:#7B1FA2,stroke-width:2px

πŸ’‘ The Big Picture

Modern LLMs didn't replace the fundamentalsβ€”they automated and scaled them. Understanding the pipeline gives you the foundation to use any NLP system effectively.

Section 16: The Full Chain - Text to Embeddings to Tasks

πŸ”—

Every NLP system follows the same fundamental flow.

The Universal NLP Pipeline

graph TB A["Raw Text
'This movie was great!'"] --> B["Preprocessing
Normalize, clean"] B --> C["Tokenization
['This', 'movie', 'was', 'great', '!']"] C --> D["Token IDs
[2023, 3544, 2001, 2307, 999]"] D --> E["Embeddings
Lookup or generate vectors"] E --> F["Model Processing
Transformers, classifiers, etc."] F --> G["Task Output
Positive sentiment, 0.92 confidence"] style A fill:#F3F2F1,stroke:#0078D4 style E fill:#5C2D91,stroke:#5C2D91,color:#fff style G fill:#107C10,stroke:#107C10,color:#fff

Detailed Example: End-to-End

Sentiment Classification Pipeline

Python
# Step 1: Raw text
text = "This movie was not great, but I still enjoyed it!"

# Step 2: Preprocessing (minimal!)
text = text.lower()  # Optional: depends on model

# Step 3: Tokenization
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokens = tokenizer.tokenize(text)
# ['this', 'movie', 'was', 'not', 'great', ',', 'but', 'i', 'still', 'enjoyed', 'it', '!']

# Step 4: Convert to Token IDs
token_ids = tokenizer.convert_tokens_to_ids(tokens)
# [2023, 3544, 2001, 2025, 2307, 1010, 2021, 1045, 2145, 5632, 2009, 999]

# Step 5: Get embeddings
from transformers import AutoModel
model = AutoModel.from_pretrained("bert-base-uncased")
embeddings = model.embeddings.word_embeddings(token_ids)
# Shape: [12 tokens, 768 dimensions]

# Step 6: Model processing (contextual attention)
outputs = model(input_ids=token_ids)
sentence_embedding = outputs.last_hidden_state.mean(dim=1)  # Pool tokens
# Shape: [768] - single vector for entire sentence

# Step 7: Task-specific layer
classifier = nn.Linear(768, 2)  # 2 classes: pos/neg
logits = classifier(sentence_embedding)
prediction = torch.softmax(logits, dim=-1)
# [0.15, 0.85] β†’ 85% positive!

# The model understood "not great, but...enjoyed" = overall positive!

Why Each Step Matters

Step Purpose Impact if Done Wrong
Preprocessing Normalize noise without losing signal Over-clean β†’ lose meaning; Under-clean β†’ noisy patterns
Tokenization Split text into learnable units Bad splits β†’ fragmented semantics, OOV issues
Token IDs Convert symbols to integers Mismatched vocabulary β†’ garbage lookups
Embeddings Convert IDs to dense semantic vectors Poor embeddings β†’ model can't learn patterns
Model Learn task-specific patterns Wrong architecture β†’ suboptimal performance
Output Map learned patterns to task predictions Misaligned objective β†’ learns wrong thing
πŸ”₯

The Garbage In, Garbage Out Principle:

Every step builds on the previous one. Bad tokenization β†’ bad embeddings β†’ bad model, regardless of how sophisticated your architecture is.

This is why we spent so much time on representations!

Section 17: How This Connects to Modern LLMs

πŸš€

Everything we learned applies to GPT, BERT, and modern Transformers.

The Evolution Timeline

2000s 2020s+
2000-2012
Classical Era

Sparse Vectors

Examples: BoW, TF-IDF, N-grams

βœ“ Works: Fast, interpretable

βœ— Fails: No semantics, high dimensionality

2013-2017
Dense Embedding Era

Static Dense

Examples: Word2Vec, GloVe, FastText

βœ“ Works: Semantics! Low dimension

βœ— Fails: One vector per word, polysemy

2018-2019
Contextual Era

Contextual Dense

Examples: BERT, ELMo, GPT-2

βœ“ Works: Context-aware, transfer learning

βœ— Fails: Slow, max sequence length

2020-Now
LLM Era

Massive Scale

Examples: GPT-3/4, Claude, Llama

βœ“ Works: Few-shot, emergent abilities

βœ— Fails: Huge compute, hallucinations

βœ… What ALWAYS Stayed the Same

β€’ Text must become numbers (tokenization still critical)

β€’ Embeddings are learned representations (still at core)

β€’ Quality of tokenization determines embedding quality

β€’ Domain mismatch still hurts performance

Garbage in, garbage out principle still applies!

What Stayed the Same

βœ…

Core principles unchanged:

  • Text must still become numbers (tokenization + IDs)
  • Models still learn through embeddings (now in embedding layers)
  • Tokenizer quality still matters (BPE/WordPiece still used)
  • Domain mismatch still hurts performance
  • Garbage in, garbage out still applies

What Changed

🎯

Modern improvements:

  • Scale: Billions of parameters, trained on trillions of tokens
  • Context: Contextual embeddings by default (BERT/GPT)
  • Transfer: Pre-training + fine-tuning paradigm
  • Architecture: Transformers with self-attention
  • Flexibility: Same model for many tasks (prompt engineering)

The Transformer Era Pipeline

Modern LLM Workflow

Python
# Using a modern LLM (e.g., GPT or BERT)
from transformers import pipeline

# Step 1: Load pretrained model (includes tokenizer + embeddings + model)
classifier = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")

# Step 2: Just pass text!
result = classifier("This movie was not great, but I still enjoyed it!")
# [{'label': 'POSITIVE', 'score': 0.92}]

# Behind the scenes:
# 1. Text β†’ tokenizer (BPE/WordPiece)
# 2. Tokens β†’ IDs (vocabulary lookup)
# 3. IDs β†’ embeddings (learned embedding layer)
# 4. Embeddings β†’ Transformer layers (self-attention)
# 5. Output β†’ task head (classification)

# Same pipeline we learned, now automated and scaled!
πŸ’‘

Key Takeaway

Modern LLMs didn't eliminate the need for understanding representations. They automated and improved the pipeline, but the fundamentals remain:

  • Text β†’ Tokens β†’ IDs β†’ Embeddings β†’ Model β†’ Output
  • Quality at each step determines final performance
  • Domain knowledge and preprocessing still matter

Now you understand what's happening inside the black box!

PART 7 OF 7 - FINAL

πŸ’» From Theory to Practice

You've learned the fundamentals. Now it's time to get hands-on with real code and data!

πŸŽ“ Your Learning Journey

βœ…

Part 1
Foundation

βœ…

Part 2
Evolution

βœ…

Part 3
Mechanics

βœ…

Part 4
Decisions

βœ…

Part 5
Guidelines

βœ…

Part 6
Integration

🎯

Part 7
Practice!

Theory complete! Now apply everything in the hands-on notebook.

πŸš€ What You'll Build

πŸ“Š

Text EDA

πŸ”§

Preprocessing

🎯

Representations

What's in the Notebook

πŸ“Š
Text EDA

Explore the NLTK movie reviews dataset

  • Length distributions
  • Vocabulary analysis
  • Class balance checks
  • Data quality signals
πŸ”§
Preprocessing with NLTK

Hands-on preprocessing pipeline

  • Tokenization strategies
  • Stopword removal
  • Lemmatization
  • Comparison of approaches
🎯
Multiple Representations

Compare text-to-number methods

  • BoW with CountVectorizer
  • TF-IDF with TfidfVectorizer
  • Word2Vec embeddings
  • Sentence-BERT embeddings

Learning Objectives (Revisited)

By completing the notebook, you'll be able to:

  1. βœ… Explain how models learn from numbers and why text needs encoding
  2. βœ… Perform structured text EDA before modeling
  3. βœ… Apply and compare preprocessing strategies with NLTK
  4. βœ… Convert text into multiple numeric representations
  5. βœ… Interpret embedding behavior with correct caveats
  6. βœ… Compare tokenizers (BERT vs GPT-2)
  7. βœ… Decide when direct embeddings are appropriate
  8. βœ… Describe the full chain: text β†’ numbers β†’ model β†’ embeddings

Next Steps: Transformers

πŸš€

You're ready for the next module!

With this foundation, you can now dive into:

  • Transformer architecture: Self-attention, positional encoding, encoder-decoder
  • Training objectives: Masked LM, causal LM, seq2seq
  • Fine-tuning strategies: Full fine-tuning vs LoRA vs prompt tuning
  • Modern LLM applications: RAG, agents, tool use

You now understand what happens before the Transformerβ€”the tokenization and embedding layers that feed into attention mechanisms.

πŸŽ“

Final Takeaway

You've learned the foundational journey:

Raw Text β†’ Preprocessing β†’ Tokens β†’ Token IDs β†’ Embeddings β†’ Model β†’ Predictions

This isn't just historyβ€”it's the current reality inside every NLP system, from simple classifiers to cutting-edge LLMs.

Now you're equipped to work with text data thoughtfully, make informed representation choices, and understand what modern NLP tools are doing under the hood.

Ready to Code?

Open the Jupyter notebook and start building your first NLP pipeline!

πŸ““ nltk_text_preprocessing_hands_on.ipynb